Title: ProGreSS: SIMULTANEOUS SEARCHING OF PROTEIN DATABASES BY SEQUENCE AND STRUCTURE PSB Session: Joint Learning from Multiple Types of Genomic Data Authors: ARNAB BHATTACHARYA, TOLGA CAN, TAMER KAHVECI, AMBUJ

نویسندگان

  • Tamer Kahveci
  • K. SINGH
  • YUAN-FANG WANG
  • ARNAB BHATTACHARYA
  • TOLGA CAN
  • TAMER KAHVECI
  • AMBUJ K. SINGH
چکیده

We consider the problem of similarity searches on protein databases based on both sequence and structure information simultaneously. Our program extracts feature vectors from both the sequence and structure components of the proteins. These feature vectors are then combined and indexed using a novel multi-dimensional index structure. For a given query, we employ this index structure to find candidate matches from the database. We develop a new method for computing the statistical significance of these candidates. The candidates with high significance are then aligned to the query protein using the Smith-Waterman technique to find the optimal alignment. The experimental results show that our method can classify up to 97 % of the superfamilies and up to 100 % of the classes correctly according to the SCOP classification. Our method is up to 37 times faster than CTSS, a recent structure search technique, combined with Smith-Waterman technique for sequences.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ProGreSS: Simultaneous Searching of Protein Databases by Sequence and Structure

We consider the problem of similarity searches on protein databases based on both sequence and structure information simultaneously. Our program extracts feature vectors from both the sequence and structure components of the proteins. These feature vectors are then combined and indexed using a novel multi-dimensional index structure. For a given query, we employ this index structure to find can...

متن کامل

Decision Tree Based Information Integration for Automated Protein Classification

We propose a novel technique for automatically generating the SCOP classification of a protein structure with high accuracy. We achieve accurate classification by combining the decisions of multiple methods using the consensus of a committee (or an ensemble) classifier. Our technique, based on decision trees, is rooted in machine learning which shows that by judicially employing component class...

متن کامل

Efficient and Automated Analysis of Protein Structures

Efficient and Automated Analysis of Protein Structures by Tolga Can In recent years, computational complexity in structural bioinformatics attained a new level with the vast increase in the amount of structural data available. The Protein Data Bank (PDB), which is the single worldwide repository for 3-D macromolecular structure data, contains more than 25k structures as of July 2004. However, e...

متن کامل

Protein Secondary Structure Prediction: a Literature Review with Focus on Machine Learning Approaches

DNA sequence, containing all genetic traits is not a functional entity. Instead, it transfers to protein sequences by transcription and translation processes. This protein sequence takes on a 3D structure later, which is a functional unit and can manage biological interactions using the information encoded in DNA. Every life process one can figure is undertaken by proteins with specific functio...

متن کامل

INSTRUCT - Space-Efficient Structure for Indexing and Complete Query Management of String Databases

The tremendous expanse of search engines, dictionary and thesaurus storage, and other text mining applications, combined with the popularity of readily available scanning devices and optical character recognition tools, has necessitated efficient storage, retrieval and management of massive text databases for various modern applications. For such applications, we propose a novel data structure,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004